Introduction to Machine Learning with PyTorch

ICCS Summer school 2023

Part 1: Some neural-network basics—and fun applications.

Stochastic gradient descent (SGD)

  • Generally speaking, most neural networks are fit/trained using SGD (or some variant of it).

  • To understand the basics of how one might fit a function with SGD, let’s do it with a straight line: \[y=mx+c\]

Fitting a straight line I

  • Question—when we a differentiate a function, what do we get?

  • Consider:

\[y = mx + c\]

\[\frac{dy}{dx} = m\]

  • \(m\) is certainly \(y\)’s slope, but is there a (perhaps) more fundamental way to view a derivative?

Fitting a straight line with SGD II

  • Answer—a function’s derivative gives a vector which points in the direction of steepest ascent.
  • Consider

\[y = x\]

\[\frac{dy}{dx} = 1\]

  • What is the direction of steepest descent?

\[-\frac{dy}{dx}\]

Fitting a straight line with SGD III

  • To fit a function, we essentially want to create a model which describes data.

  • We therefore need a way of measuring how a model’s predictions deviate from our observations.

  • Consider the data:
\(x_{i}\) \(y_{i}\)
1.0 2.1
2.0 3.9
3.0 6.2
  • We can measure the distance between \(f(x_{i})\) and \(y_{i}\).

  • Normally we might consider the mean-squared error:

\[L_{\text{MSE}} = \frac{1}{n}\sum_{i=1}^{n}\left(y_{i} - f(x_{i})\right)^{2}\]

  • We can differentiate the loss function w.r.t. to each parameter in the the model \(f\).

  • We can use the direction of steepest descent to iteratively ‘nudge’ the parameters in a direction which reduces the loss.

Fitting a straight line with SGD IV

  • Model: \(f(x) = mx + c\)

  • Data: \(\{x_{i}, y_{i}\}\)

  • Loss: \(\frac{1}{n}\sum_{i=1}^{n}(y_{i} - x_{i})^{2}\)

  • Loss:

\[L_{\text{MSE}} = \frac{1}{n}\sum_{i=1}^{n}(y_{i} - f(x_{i}))^{2}\\ = \frac{1}{n}\sum_{i=1}^{n}(y_{i} - mx_{i} + c)^{2} \]

  • We can iteratively minimise the loss by stepping the model’s parameters in the direction of steepest descent:

\[m_{n + 1} = -m_{n}\frac{dL}{dm} \cdot l_{r}\]

\[c_{n + 1} = -c_{n}\frac{dL}{dm} \cdot l_{r}\]

  • \(l_{\text{r}}\) is a small constant known as the learning rate.

Quick recap

To fit a model we need:

  • A model.

  • Some data.

  • A loss function

  • An optimisation procedure (often SGD and other flavours of SGD).

All in all, ’tis quite simple.

What about neural networks?

  • Neural networks are just functions.

  • We can ‘’train’’, or fit, them as we would any other function:

    • by iteratively nudging parameters to minimise a the loss.
  • With neural networks, differentiating the loss function is a bit more complicated.

    • but ultimately it’s just the chain rule.
  • We won’t go through any more maths on the matter—learning resources on the topic are in no short supply.

Fully-connected neural networks

  • The simplest neural networks commonly used are generally called fully-connected nerual nets, dense networks, multi-layer perceptrons or artifical neural networks (ANNs).
  • We map between the features at consecutive layers through matrix multiplication and the application of some non-linear activation function.

\[a_{l+1} = \sigma \left( W_{l}a_{l} + b_{l} \right)\]

  • For common choices of activation functions, see the PyTorch docs.

Uses: Classification and Regression

  • Fully-connected neural networks are often applied to tabular data.

    • I.e. where it makes sense to express the data in table-like object (such as a data frame).
    • The inputs features, and targets, are presented as vectors.
  • Normally people use neural networks for one of two things:

    • Classification: assigning a semantic label to something—i.e. is this a dog or cat?

    • Regression: Estimating a continuous quantity such as mass or volume.

Python and PyTorch

  • In this workship-lecture-thing, we will implement some straightforward neural networks in PyTorch, and use them for different classification and regression problems.

  • PyTorch is a deep learning framework that can be used in both Python and C++.

    • I have never met anyone actually training models in C++; I find it a bit weird.
  • See the PyTorch website: https://pytorch.org/

Exercises: Penguins

Exercise 1—classification

  • In this exercise, you will train a fully-connected neural network to classify the species of penguins based on certain physical features.

  • https://github.com/allisonhorst/palmerpenguins

  • Thanks to Jack Atkinson for suggesting this dataset.

Exercise 2—regression

  • In this exercise, you will train a fully-connected neural network to predict the mass of penguins based on other physical features.

  • https://github.com/allisonhorst/palmerpenguins

  • Thanks (again) to Jack Atkinson for suggesting this dataset.

Part 2: Fun with CNNs

Convolutional neural networks (CNNs): why?

Advantages over simple ANNs

  • They require many fewer parameters per layer.
    • The forward pass of a conv layer involves passing a filter—of fixed size—over the inputs.
    • The number of parameters per layer does not depend on the input size.
  • They are a much more natural choice of function for image-like data.

Some other points

  • Convolutional layers are translationally invariant:
    • They don’t care where the “dog” is in the image.
  • Convolutional layers are not rotationally invariant.
    • A model trained to detect correctly-oriented human faces will likely fail on upside-down images, etc.
    • We can address this with data augmentation (explored in exercises).

What is a (1D) convolutional layer?

Look at the torch.nn.Conv1d docs

2D convolutional layer

  • Same idea as in on dimension, but in two (funnily enough).
  • Everthing else proceeds in the same way as with the 1D case.
  • See the torch.nn.Conv2d docs.
  • As with Linear layers, Conv2d layers also have non-linear activations applied to them.

Typical CNN overview

Exercises

Exercise 1—classification

MNIST hand-written digits.

  • In this exercise we’ll train a CNN to classify hand-written digits in the MNSIT dataset.

  • See the MNIST database wiki for more details.

Exercise 2—regression

Random ellipse problem

  • In this exercise, we’ll train a CNN to estimate the centre \((x_{\text{c}}, y_{\text{c}})\) and the \(x\) and \(y\) radii of an ellipse defined by \[ \frac{(x - x_{\text{c}})^{2}}{r_{x}^{2}} + \frac{(y - y_{\text{c}})^{2}}{r_{y}^{2}} = 1 \]

  • The ellipse, and the background, will have random colours chosen uniformly on \(\left[0,\ 255\right]^{3}\).

  • In short, the model must learn to estimate \(x_{\text{c}}\), \(y_{\text{c}}\), \(r_{x}\) and \(r_{y}\).

Further information

Slides

Contact

Resources